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Abstract 

The outcome of a functional genomics pipeline is usually a 
partial list of genomic features, ranked by their relevance in 
modelling biological phenotype in terms of a classification or 
regression model. Due to resampling protocols or just within 
a meta-analysis comparison, instead of one list it is often the 
case that sets of alternative feature lists (possibly of differ- 
ent lengths) are obtained. Here we introduce a method, based 
on the algebraic theory of symmetric groups, for studying the 
variability between lists ("list stability") in the case of lists 
of unequal length. We provide algorithms evaluating stability 
for lists embedded in the full feature set or just limited to the 
features occurring in the partial Usts. The method is demon- 
strated first on synthetic data in a gene filtering task and then 
for finding gene profiles on a recent prostate cancer dataset. 

1 INTRODUCTION 

Defining indicators for assessing ranked fists variability has 
become a key research issue in functional genomics CT] |2] [3] 

ilia. 

In 161, a method is introduced to detect stability (homogene- 
ity) of a set of ranked lists of biomarkers emerging as out- 
put of feature selection algorithm during a molecular profiling 
task. The stability indicator relies on the application of met- 
ric methods for ordered data viewed as elements of a suitable 
permutation group: foundations of such theory can be found 



in Q [U and it is based on the concept of distance between 
two lists; in particular, the employed metric is the Canberra 
distance as discussed in |j91. The mathematical details of the 
stability procedure are described in ifTOlfTTl : given a set of or- 
dered lists, the basic mechanism is to evaluate the degree of 
self-homogeneity of a set of ordered lists through the compu- 
tation of all the mutual distances between the elements of the 
original set. Moreover, by using the location parameter, the 
Canberra distance can also be computed between upper par- 
tial lists of the original complete lists, the so called top-fc lists 
lfT2l . formed by their k best ranked elements. The method 
proposed in has a main drawback limiting its application 
in many practical situations: the studied lists are required to 
have the same length. As complete lists, they all share the 
same elements, with only their ordering being different and, 
when considering partial top-fc lists, the same k must be cho- 
sen for all sublists. 

This is usually not the case when investigating the outcomes 
of profiling experiments where the employed feature ranking 
method does not produce a rank for every involved feature, 
but it just scores the best performing ones, thus leading to 
the construction of lists with different lengths. Some work 
towards partial lists comparison has recently appeared in lit- 
erature, both for general contexts IfTSlI and more focussed on 
the gene ranking case lfT4l[T5l[T6l . but they all consist of set- 
theoretical measures. 

In the present work we propose an extension of the method 
introduced in ||6l to mend this flaw by allowing computing 
(Canberra) distance for two lists of different length, defined 
within the framework of the metric methods for permutation 
groups. The novel approach is based on the use of quotients 
of permutation groups. The key formula can be split into two 
main components: one taking care of the elements occurring 
in the selected lists, and the second one considering the re- 
maining elements of the complete set of features the exper- 
iment started from. In particular, this second component is 
independent from the positions of the selected elements in the 
lists: neglecting this part, a different stability measure (called 
the core component of the complete formula) is obtained. An 
applications of the described methods can be found in IfTTl . 

After having detailed the central algorithm in Section [21 
applications to synthetic and genomics datasets and different 
machine learning tasks are discussed in Section [3l The de- 
scribed algorithm is publicly available within the Python pack- 
age mlpy (h ttps : / /mlpy . fbk . eu) for statistical ma- 
chine learning. 
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2 MATERIALS AND METHODS 

2.1 Introduction 

The procedure described in f6l| is made of two separate parts, 
the former concerning the computation of all the mutual dis- 
tances between the (complete or partial) lists, and the latter the 
construction of the matrix starting from those distances and of 
the indicator coming from the defined matrix. This second 
phase is independent from the length of the considered lists: 
the extension shown hereafter only affects the previous step, 
i.e. the distance definition. The algorithm relies on applica- 
tion of metric methods for ordered data viewed as elements of 
a suitable permutation group: foundations of such theory can 
be found in ifTSl [191 |2l EJ and it is based on the concept of 
distance between two lists; in particular, the employed metric 
is the Canberra distance ||9l- The mathematical details of the 
procedure are shown in ifTOlfTTI . The described algorithm de- 
fines a measure to evaluate the degree of self-homogeneity of 
a set of ordered lists through the computation of all the mu- 
tual distances between the elements of the original set. More- 
over, the distance can also be computed between upper par- 
tial lists of the original complete lists, the so called top-A: lists 
ifTSl . formed by their k best ranked elements. The indicator 
introduced in |6) has a main drawback limiting its applica- 
tion in many practical situations: the studied lists are required 
to have the same length. In fact, as complete lists they must 
share the same elements, with only their ordering being differ- 
ent and, when considering partial top-fc lists, the same k must 
be chosen for all sublists. This is usually not the case when 
investigating the outcomes of profiling experiments where the 
employed feature ranking method does not produce a rank for 
every involved feature, but it just scores the best performing 
ones, thus leading to the construction of lists with different 
length. Some work towards this task has recently appeared in 
literature, both for general contexts 111 3 J and more focussed 
on the gene ranking case lfT4l [TSl [T6l . but they all consist of 
set-theoretical measure. In the present work we propose an 
extension of the method introduced in |6i to mend this flaw by 
allowing computing (Canberra) distance for two lists of dif- 
ferent length, still defined within the framework of the metric 
methods for permutation groups. 

2.2 Notations 

Let T = {i^j}j=i,....p be a set of p features, and let L be 
a ranked list consisting of I elements extracted (without re- 
placement) from J". If L = (Fli , Fl^ , . . . , F^, ), let T{j) be 
the rank of Fj in L (with t{Fz) ^ if F^ ^ L) and define 
= {'''{i))i=i r> '^he dual list of L. Consider the set Sl 



of aU elements of the symmetric group Sjr on T whose top-Z 
sublist is L: Sl has (p — /)! elements and it is isomorphic to 
(a coset of) Sp^i. Finally, let Sr be the set of all the dual lists 
of the elements in Sl- if a G Sr, then a{i) = T(i) for all 
indexes i E L. Thus Sr consists of the {p — \L\)\ (dual) per- 
mutations of Sp coinciding with t on the elements belonging 
to L. Furthermore, note that t{L) = {!,..., 

2.3 Shorthands 

If Hs is used to denote the s-th armonic number defined as 
Hs = then we can define some useful shorthands: 



A(a,&,c)= 



\c-j\ 

c + i 



a<i<b 

'b-a + l- 2c{Hb+c - Ha+c-i) 

if c < a 

2c{H2c - Ha+c-i - Hb+c - 1) + 6 + a - 1 

if a < c < b 

2c{Hb+c - Ha+c-i) ~b + a -1 

if c> 6 , 



and 



_ is-k){s + k + l) 



k{k + l) 

J^s+k+l H 7i 



+ 



s{2k - s - 1) 



1 

S + -) H,s+,--Hs 



2s^ + .s + 1 



Finally, 



e(a,/3,7)= E E 



\u — v\ 



= E A(/3,7,^), 



u + V 

with 8(a, /3, 7) = 0,7). Details on harmonic number 
can be found in 1201 . while some new techniques for dealing 
with sums and products of harmonic numbers are shown in 
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2.4 Canberra distance on permutation groups 

Given two real-valued vectors x, y G M", their Canberra dis- 
tance 191 is defined as 



Ca(x,y) = ^ 



+ \yi 



This metric can be naturally extended to a distance on permu- 
tation groups: for t,<t E Sp, we have 

^ Til) + ail) 

The expected (average) value of the Canberra metric on the 
whole group Sp can be computed as follows: 



E{CaiSp)} = -^ J2 Ca(a,r) 



\S, 



^ ^ Ca(a,IdsJ 



p! -Tt; + * 



(1) 



2n + 2 + — lf2„ 
2n J 

2n + 2 + l-)K,.-L+l 



2.5 Canberra Distance on Partial Lists 

If Li and L2 are two (partial) lists of length respectively 
li < I2 whose elements belong to J^, and d is a distance on 
permutation groups, define the distance between Li and L2 as 

, L2) = / ({d(a, l3):aeSr,,l3eSr,}) 

= .fidiSTi,Sr2)) , 

for / a function of the (p — — ^2)! distances (3) such 
that on a singleton t, fi{t}) = t. Note that if Li and L2 are 
complete lists, the above definition coincides on complete lists 
with the usual definition of distance between complete lists 
given in |6||. Moreover, being d a distance, the smaller the 
value the more similar the compared Usts. A natural choice 
for the function /, motivated also from the fact that the many 
distances for permutation groups (and we proved this is the 
case for the Canberra distance in ifTTHTOl ) are asymptotically 
normal Il30l . is the mean function, so that 

diLuL2)= , E E'^^"'/^)- (2) 



I ^Ti I ■ I I 



We note that this definition differs from the one first intro- 
duced in |f6l| because the relation between the size of ac- 
tually used features and the size the original feature set is 
taken into account here. This is relevant while performing 
genomic profiling experiments. Consider the decomposition 
of the set J' into the three disjoint sets (ignoring the fea- 
tures' rank) F12 = Li n L2 , Fj^ = J" \ (Li U L2) and 
Fi/2 = (ii U L2) \ (£1 n L2). Then, if d = Ca is the Can- 
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berra distance and A ~ - — 

ip ~ hy-ip - hV- 

split as follows into three terms: 



, the Eq. © can be 



Ca(Li,L2) = ^^ E E 

^ ^ aii)+l3ii) ' 

a{i)—Ti{i) if i ^ Li 
l3{i)=T2(i) if i G L2 



= A E E 
= AE E 

-A E 



aii) + /3ii) 
\aii)-m 



aii) + /3ii) 



E 



F,eFi2UFi/2Ui=V2(a,/3)GS^i xSr^ 

+ E E 

FieFi/2 {a,p)eS^^xS^ 

+ E E 

FieF-r2(Q,/3)es,iXS,2 



aii) 




\aii) 


-mi 


aii) 


+m 




-m\' 



aii)+ Pii) 



(Tl) 
(T2) 

(T3) 



Expanding the three terms Tl, T2, T3 a closed form for the 
distance can be reached: 



Definition 1. Complete Canberra Distance. The Complete 
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Canberra Distance (between partial lists) is defined as 

_ A(^2 + l,P,Ti(i)) 



(3) 
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p-h 

+ -^{h{p-h) + ^ei,{h)+m2) 
p - k 

- 2^ih) - 2ei,il2) - 2epil2) 
+ {p + li){l2 - h) + hih + I) 

+ A ■ i2^{p) - 2ah) - 2ei, (p) + 2ei, [h) 

- 2ep{p) + 2ep{l2) + {p + li){p - I2) 

+ l2{l2 + l)~p{p+l)) , 



where A = 



\T\{Li\JL2) 



The sum generating the term 



{p-h){p~l2) 

T3 in Eq. (ITll) runs over the subset Fj^ collecting all elements 
of the original feature set which do not occur neither in the first 
list nor in the second. Thus this part of the formula is indepen- 
dent from the positions of the elements occurring in the partial 
lists Li, L2- Neglecting this term, we obtain another measure 
of list difference: 

Definition 2. Core Canberra Distance. The Core Canberra 
is defined as the components Tl, T2 of the Complete Canberra 
Distance depending on the positions of the elements in the 
considered partial lists. This corresponds to setting ^ = in 
Eq. ©ofDef.IU 

Throughout the paper, the values of both instances of the 
Canberra Distance are normalized by dividing them by the ex- 
pected value E{Ca.{Sp)} on the whole permutation group Sp 
reported in Eq. ([T]i. This would result in two random (com- 
plete) lists having a Complete Canberra Distance very close to 
one; note that, since the expected value is not the highest one, 
distance values greater than one can occur When the number 
of features in not occurring in Li, L2 becomes larger, the 
non-core component gets numerically highly preeminent: in 
fact, in the term T3 all the possible {p — 11)1 {p — 12)1 lists in Sp 
having Li and L2 respectively as top lists are considered; as an 
example, forp = 10000 and Li, L2 two partial lists with 100 
elements, this corresponds in evaluating the distance among 
9900!^ ~ 2.2 • 10^519 lists of elements not occurring in Li, 



Lisls 


Disl. 


c = 2 


c = 3 


c = 4 


c = 5 


Idcnliciil 


Comp. 


0.692830 


0.960499 


0.995858 


0.999583 


Random 


Core 


0.078038 


0.006368 


0.000950 


0.000109 


Riindom 


Comp. 


0.770868 


0.966867 


0.996809 


0.999692 


Max.Disl. 


Core 


0.128448 


0.012665 


0.001265 


0.000126 


Max.Disi. 


Comp. 


0.821278 


0.973164 


0.997123 


0.999709 



Table 1 : Core and Complete normalized Canberra distance for 
two partial lists of 10 features extracted from a set of \J^\ = 
features. The partial lists are either identical, randomly 
permuted (average distance) or maximally distant. The Core 
Distance for Identical partial lists is zero. 



L2- When the number of lists of unselected elements grows 
larger, the average distance among them would get closer to 
the expected value of the Canberra distance on Sp because of 
the Hoeffding's Theorem. 

This is quite often the case for biological ranked lists: for 
instance, selecting a panel of biomarkers from a set of probes 
usually means choosing less than hundred of features out of an 
original set of several thousands. Thus, considering the Core 
component instead of the Complete Distance can be helpful in 
term of dimensionality reduction of the considered problem. 

As an example, in Tab. [T]we show the values of the normal- 
ized distances of two partial lists of length 10 extracted from 
an original set T with p — features (c = 2, 3, 4, 5), in the 
three cases of identical partial lists, randomly permuted par- 
tial lists (which yields average distance) and maximally dis- 
tant partial lists (see ifTOl [TTI for the identification of the per- 
mutation maximizing the Canberra distance between lists). A 
further observation can be derived from Fig. [T] where the ra- 
tio between Core and Canberra distances are plotted versus the 
ratio between the length of partial lists and the size of the full 
feature set for about 7000 instances of couples of partial lists 
of the same length randomly permuted. When the number of 
elements of the partial lists is a small portion of the total fea- 
ture, the Complete and the Core distance are almost linearly 
dependent, while when such ratio approaches one the ratio 
between the two measures follows a different function. As 
shown above, the Core measure is more convenient to better 
focus on differences occurring among lists of relatively small 
length. On the other hand, the Complete version is the elec- 
tive choice when the original feature set is large and the partial 
lists' length is of comparable order of magnitude of \ J-\. 



2.6 Expansion of Eq. dTB 

The three terms occurring in Eq. dTlb can be expanded 
through a few algebraic steps in a more closed form, reduc- 
ing the use of sums wherever possible. 
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\2 = {p — hY-ip — h — 1)!, the term can be rearranged as: 



Figure 1 : (a) Ratio between Core and Complete distances ver- 
sus ratio between the length of partial lists and the size of the 
full feature set for about 7000 instances of couples of partial 
lists. Lists pairs have the same length and they are randomly 
permuted, with partial lists length ranging between 1 and 5000 
and full set size ranging between 10 and 100000. (b) Zoom of 
the bottom left corner of panel (a): Core and Complete dis- 
tances are proportional when the ratio between the length of 
partial lists and the size of the full feature set is less than about 
0.15. 



2.6.1 Tl: coimnon features 

The first term is the component of the distance computed over 
the features appearing in both lists Li, L2, thus no complete 
closed form can be written. The expansion reads as follows: 



E E 

- E E 



\aii)-m\ 
a{i) + l3{i) 



\aii)-m\ 
a{i) + l3{i) 

InW-r^WI 



- E \SrA-\Sr,_\ ^^^^^^^^^^^ 



ieLinL2 



{p~h)\{p~h)\ 



\Ti{i) - T2{i)\ 



E 



ieLinL2 



ieLinL 
\rl{^)~r2{i)\ 

Tl{i) + T2{i) 



Tl{i) + T2{i) 



E E 

= E E 



+ /3(i) 



E Ei^^ 



- E Ei^^ 

i&L2 -i^Li a^S-r^ 



-X, V V 

^ ^ Tl(l) + 1 

i£LiA<^L2 j=l2 + l 

= Ai A(;2 + l,p,riW) 

iGLi ,i^L2 

+ X2 J2 Mh + l,P,T2{i)) 

i£L2,i^Li 

= Ai [ ^ A(/2 + l,P,Ti(i)) 

- Y A(/2 + l,p,riW)) 

+ A2 ( ^ A(?i+l,p,T2(*)) 

- J2 A(/i + l,p,T2W) J . 

FieFi2 ) 



2.6.2 T2: features occurring only in one list 

The second term regards the elements occurring only in one 
of the two lists. By defining Ai = (p — /i)!(p ~ h — 1)' and 



2.6.3 T3: unselected features 

The last term represents the component of the distance com- 
puted on the factor group, that is the elements of the original 
feature set not appearing in ii or ^2- Here a complete closed 
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form can be reached: 



2.9.1 The Tib datasets. 



E E 



Hi)~m\ 

a{i) + l3{i) 



E E 



a{i) + l3{i) 



i + j 



'l = h+l ]=l2 + l 

\F-^\ip -h- i)!(p -h- ly.eih + 1J2 + i,p) 

{p-hjip-h) 
+ 2ei, ih) - 2ep{p) + 2sp{l2) + {p + li){p- h) 

+ l2{l2 + l)-p{p+l)) ■ 



We build two datasets simulating a microarray dataset inspired 
by The datasets Tib 100 and TibSOO consist of 100 sam- 
ples and are described respectively by 100 and 500 features 
(genes). The first 50 samples were assigned to class 1, the oth- 
ers to class -1. All expression values were generated as stan- 
dard normally distributed numbers. Genes 1-20 (1-50) have 
mean 1 for samples 1-50 and mean -1 for samples 51-100. 
Initially, genes 21-100 (51-500) in all the samples have mean 
0. Then three substitutions are performed, where a percent- 
age P of all genes from the a-th to the 6-th are replaced by 
normally distributed numbers with mean m, namely: 

1. P = 40, a = 21, 6 = 40 (50) and m = -1; 

2. P = 50, a = 41 (51), 6 = 60 (150) and to = 1; 



2.7 The Borda list 

To summarize the information coming from a set of lists C 
into a single optimal list we adopt the same strategy of ||6l, 
i.e. an extension of the classical voting theory methods known 
as the Borda count ISTl [32l . This method, in its basic ver- 
sion, derives a single list from a set of B lists on p candidates 
Fi, . . . ,Fp by ranking them according to a score s{Fi) de- 
fined by the total number of candidates ranked higher than Fi 
over all lists. Our extension consists in first computing, for 
each feature Fj, its number of extractions (the number of lists 
where Fj appears) Cj = |{i G {1 . . . B} : Fj G Li}\ and its 

average position number ak{j) = — TiU) 

^"^ {ie{l...B}: Fj6LJ 

and then ranking the features by decreasing extraction number 
and by increasing average position number when ties occur. 
The resulting list will be called optimal list or Borda list. In 
ll6l the equivalence of this ranking with the Borda count is 
proved. 



2.8 Implementation 

The computation of the stability indicator for partial lists 
is publicly available (since version 1.1.2) within the Open 
Source Python package mlpy (ht tps : //mlpy . fbk . euj 
for statistical machine learning | 



2.9 Data description 

For the experiments described in the RESULTS, we used two 
datasets: a synthtetic dataset and a microarray dataset. 



3. P = 70, a = 61 (151), 6 = 70 (250) and m = 0.5. 

While only the first 20 genes are truly discriminating, the 
noisy part of the dataset is modified in order to give a partial 
discriminating power also to genes 21-70 (21-250), leaving 
only the genes 71-100 (251-500) as undiscriminative features. 



2.9.2 The Prostate Cancer dataset. 

We use the publicly available prostate cancer dataset described 
in 1361 and available from GEO (accession number GSE8402) 
built from a custom Illumina DASL Assay of 6144 genes 
known from literature to be prostate cancer related. Setlur et 
al. identified a subtype of prostate cancer characterized by the 
fusion of the 5 '-untranslated region of the androgen-regulated 
transmembrane protease serine 2 (TMPRSS2) promoter with 
erythroblast transformation-specific transcription factor fam- 
ily members (TMPRSS2-ER). As mentioned in the original 
paper, "the common fusion between TMPRESS2 and v-ets 
erythroblastosis virus E26 oncogene homolog (avian) (ERG) 
is associated with a more aggressive clinical phenotype, im- 
plying the existence of a distinct subclass of prostate cancer 
defined by this fusion". The discrimination task consists in 
separating positive TMPRSS2-ERG gene fusion cases from 
negative ones. The database includes two different cohorts of 
patients: the US Physician Health Study Prostatectomy Con- 
firmation Cohort, with 41 positive and 60 negative samples, 
and the Swedish Watchful Waiting Cohort, consisting of 62 
positive and 292 negative samples. In what follows, we will 
indicate the whole dataset as Setlur, and its two cohorts by the 
shorthands US and Sweden. 
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3 RESULTS 

Two applications are shown in the present section as practi- 
cal examples of use of the proposed method within common 
tasks in computational biology. First we outline how to use the 
Canberra distance to compare the different behaviours of sev- 
eral filtering methods on a synthetic dataset. Identifying the 
genes which are differentially expressed between two groups 
of samples is a key task in a profiling study: when the sam- 
ple size is small this may be quite tricky, since the chances of 
selecting false positives are relevant. Many algorithms have 
been devised to deal with such issue: an important family is 
represented by the filter methods, which essentially consist in 
applying a suitable statistic to the dataset to rank the genes 
in term of a degree of differential expression, and then de- 
ciding a threshold (cutoff) on such degree to discriminate the 
differentially expressed genes. Reliability of a method over 
another is a debated issue in literature: while some authors 
thinks that the lists coming from using FC ratio are more re- 
producible than those emerging by ranking genes according to 
the P- value of t-test 137] |38], others point out that f-test 
and F-test better address some FC deficiencies (e.g. ignoring 
variation within the same class) and they are recommended for 
small sample size datasets. Most researcher also agree on the 
fact that SAM gOl gl] HI |43] ED should outperform all other 
three methods because of its ability in controlling the false dis- 
covery rate. Moreover, in P31 the author show that motivation 
for the use of either FC or mod-< is essentially biological while 
ordinary t statistic is shown to be inferior to the mod-t statistic 
and therefore should be avoided for microarray analysis. In 
the extensive study P6l . also alternative methods such as Em- 
pirical Bayes Statistics, Between Group Analysis and Rank 
Product have been taken into account, applying them to 9 mi- 
croarray publicly available datasets. The resulting gene lists 
are compared only in terms of number of overlapping genes 
and predictive performance when using as features to train 
four different classifier. Here we will study the stability of 
the lists of discriminative genes produced by several filtering 
algorithms as a function of the number of samples, by evaluat- 
ing it at different values of the filtering thresholds. Our second 
application is a profiling task on a publicly available recent 
prostate cancer dataset, where we aim at detecting a panel of 
genes involved in the discrimination between patients express- 
ing or not a certain gene fusion. We use the ranked partial lists 
produced by replicated cross-validations to better character- 
ize the seeked panel and to detect differences between the two 
cohorts in the dataset. Finally, we compare the set of ranked 
lists produced by the profiling experiment with the sets of lists 
retrieved by applying the same seven filtering methods to the 
prostate cancer dataset, to show similarities and differences 



of lists obtained when looking for a discriminative predictive 
panel and when identifying differentially expressed genes. 

3.1 Gene Filtering on the Tib datasets 

The stability (i.e. robustness against input variation) of the 
gene lists produced by different filtering strategies on the two 
Tib 100 and Tib500 datasets and computed for different con- 
figurations of two parameters (the number of samples and the 
filtering threshold) is assessed through the experiment out- 
lined in the present section. By using the stability indicator 
defined, we explore the properties of 7 state-of-the-art filter- 
ing approaches in terms of homogeneity of the ordered lists of 
features identified as differentially expressed on two synthetic 
datasets. The statistics considered are Fold Change (FC) BOl . 
Significance Analysis of Microarray (SAM) BOl . B statistics 
HTl . F statistics gSl, t statistics g6|, and mod-F and mod- 
t statistics P9ll , which are the moderated version of F and 
t statistics. The FC of a given gene is defined here as the 
ratio of the average expression value computed over the two 
groups of samples. We apply the stability indicators to list 
sets L{n, A, 9) of cardinality B = 100, where 

• n is the number of samples of each class selected from 
the original dataset considered in the stability analysis 
(for each experiment we consider a subset of the origi- 
nal dataset of cardinality equal to 2n): n ranges between 
5 and 45; 

• A indicates one of the 7 filtering statistics: FC, SAM, B 
statistics, F statistics, t statistics, and mod-F and mod-t 
statistics; 

• 6* is the threshold considered for A so that a set of 100 
values was chosen for each ^ as a percentage of the A 
range. 

We indicate also as i = i{6) the number of elements of the list 
set {i < B) and for representation's clearness we will consider 
L(7i, A, 9, i). For each parameter configurations we compute 
the Core Canberra distance. All filtering statistics are com- 
puted by using the package DEDS Il50l for BioConductor fJH 
within the statistical environment R 1521 . In Fig.|2]we repre- 
sent the value of the Core Canberra distance for some of the 
values of the triplet (dataset. A, measure). A few consider- 
ation can be drawn by observing the reported images. First 
of all, three groups of different behaviours can be identified: 
F, mod-F and B group together and t, moA-t and SAM do 
the same, while FC exhibits a completely different shape. In 
both cases, the mod statistic (both F and t) belongs to the same 
group and it has a better regularization than the corresponding 
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Figure 2: Levelplot of the Core Canberra distance evaluated on the list sets A, 0, i). White pixels indicates experiments in 
which all features are discarded from all lists {i — 0). The white line separates hst sets with i = B (left side) from those with 
i < B (right side), (a)-(g): TibSOO, (h): TiblOO; (a): F, (b): mod-i^, (c): B, (d): FC, (e): t, (f): mod-i, (g)-(h): SAM. 



classic counterpart, resulting more robust in the small sample 
size case. 

The group including t, mod-i, and SAM has a small num- 
ber of void lists even for high thresholds, and the slope of its 



white line is higher: both facts indicate a not so strong de- 
pendence on the number of samples considered. On the other 
hand, when the same threshold is considered, the stability is 
higher for the group F, mod-F, and B. The levelplot |2}i for 
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FC shows that the dependence of this methods on the dataset 
sample size is opposite to the behaviour of all other methods: 
at a given threshold, the constraint imposed gets stricter for 
increasing number of samples. The darker horn-shaped area 
in the rightmost zone of the plots is probably due to the effect 
that the relevant features come in groups because of the dataset 
definition, and this is mostly evident in the small sample sizes. 
As a final consideration, plots [2^ and|2}i show that considering 
the smaller dataset Tib 1 00 instead of the larger TibSOO reflects 
in losing some details in the corresponding levelplot. 

A few more computations and considerations on the stabil- 
ity of the obtained lists are shown and discussed in the follow- 
ing subsections. 

3.2 Profiling 

The second application concerns the assessment of the stabil- 
ity of sets of gene panels derived from profiling tasks, where 
different configuration of the learning scheme (e.g. the clas- 
sifier, or the ranking algorithm) are compared. Here the task 
is how to select a list of predictive biomarkers and a classifier 
to predictively discriminate prostate cancer patients carrying 
the TMPRSS2-ER gene fusion. A basic Data Analysis Pro- 
tocol (DAP for short) is applied to both cohorts of the Setlur 
dataset, namely a stratified lOx 5-CV, using three different 
classifiers: Diagonal Linear Discriminant Analysis (DLDA) 
Il53]|54]|55][56l, linear Support Vector Machines (SVM), and 
Spectral Regression Discriminant Analysis (SRDA) ifSTl . A 
tuning phase through landscaping identified 10^'^ as the op- 
timal value for the SVM regularizer C on both dataset, and 
10^ and 10^ as the two values for the SRDA parameter a 
respectively on the US and the Sweden cohort (no tuning is 
needed for the DLDA classifier). Furthermore, in the SVM 
case the dataset is standardized to mean zero and variance 
one. The Entropy-based Recursive Feature Elimination (E- 
RFE, Il58l ) ranking algorithm is run on the training portion 
of the cross-validation split and classification models with in- 
creasing number of best ranked features are computed on the 
test part. The performances are evaluated at fixed feature set 
sizes by averaging over the CV replicates the Matthew Cor- 
relation Coefficient (MCC for short, see |[59l) Eq. ^) and 
the Area Under the ROC Curve (AUC for short) by using 
the Wilcoxon-Mann- Whitney formula Eq. (|5]l to extend the 
measure to binary classifiers. In ll60l 16111621 the equivalence 
with other formulations is shown: in particular, it is proved 
that the Wilcoxon-Mann- Whitney formula is an unbiased es- 
timator of the classical AUC. The two performance metrics 
adopted have been chosen because they are generally regarded 
as being two of best measures in describing the confusion ma- 
trix (see Tab. |2]i of true and false positives and negatives by a 
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PrudiL-led 


Positive 


TP 




FP 


value 


Negative 


FN 




TN 



Table 2: Confusion matrix for a binary problem; T/F: 
true/false; TPh-FN: all positive samples, TNh-FP: all negative 
samples. 

single number. MCC's range is [—1,1], where MCC ~ 
corresponds to the no-information error rate, which is, for 
a dataset with P positive samples and N negative samples, 
equivalent to '"'p^_/^"^ ■ MCC=1 is the perfect classification 
(FP=FN=0), while MCC=-1 denotes the worst possible per- 
formance TN=TP=0. 



MCC 



TP • TN - FP ■ FN 



^(TP + FP) (TP + FN) (TN + FP) (TN + FN) 

(4) 

where TN, FR FN, TP as in Tab. |2] . 



AUC 



(5) 



where / classifier, {x^}"^ positive, {a;^}"" negative. 

In Tabs. |4] and |3] we report the performances on SVM and 
SRDA on discrete steps of top ranked features ranging from 
5 to 6144, with 95% bootstrap confidence intervals; for 
comparison purposes we also report AUC values for the 
SRDA classifier in Tab. |5] For the same values k of the 
feature set sizes, the Canberra Core Distance is also computed 
on the top-fc ranked lists as produced by the E-RFE algo- 
rithm: the stability is also shown in the same tables. DLDA 
automatically choses the optimal number of features to use in 
order to maximize MCC (by tuning the internal parameter n/, 
starting from the default value ri/ = 0), thus it is meaningless 
evaluating this classifier on a different feature set size. In 
particular, DLDA reaches maximal performances with one 
feature (which is the same for all replicates, DAP2_5229, 
leading to a zero stability value): the resulting MCC is 0.26 
(CI: (0.18, 0.34)) and 0.16 (CI: 0.12, 0.19) respectively for 
the US and the Sweden cohort. As a reference, 5-CV with 
9-NN (which has higher performance than fc — {5,7,ll})has 
MCC 0.36 on both cohorts with all features. All results are 
displayed in the performance/stability plots of Fig. [3J)- These 
plots can be used as a diagnostic for model selection to detect 
a possible choice for the optimal model as a reasonable com- 
promise between good performances (towards the rightmost 
part of the graph) and good stability (towards the bottom of 
the graph). For instance, in the case shown we decide to use 
SRDA as the better classifier, using 25 features on the Sweden 
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Figure 3: MCC and Canberra Core values computed by using the SRDA, SVM, and DLDA models on the two Setlur datasets. 
Each point indicates a model with a fixed number of features, marked above the corresponding CI line. 
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Table 3: MCC and Core Canberra values for the two Setlur 
datasets for SVM classifiers. 
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0.46 
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0.51 
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Table 4: MCC and Core Canberra values for the two Setlur 
datasets for SRDA classifiers. 



cohort and 10 on the US cohort: looking at the zoomed graph 
in Fig. |3j), if we suppose that the points are describing an 
ideal Pareto front, the two chosen models are the closest to 
the bottom right corner of the plots. The corresponding Borda 
optimal lists for SRDA models on the two Setlur datasets 
are detailed in Tab. |6] 5 probes are common to the two 
list, and, in particular, the top ranked probe is the same. In 
Tab. |2]we list the MCC obtained by applying the SRDA and 
DLDA models on the two Setlur cohorts (exchanging their 
role as training and test set) by using the two optimal Borda 
lists. The probe DAP2_5229 probe seems to have a relevant 
discriminative and predictive importance, as shown by the 
classwise boxplots on the two cohorts of Fig. |4] As detailed 
in GEO http : / /www . ncbi . nlm . nih . gov/ geo| 



and in NCBI Nucleotide DB 

http : / / www . ncbi . nlm . nih . gov/nuccore/' its 
RefSeq ID is NM_004449, whose functional description is re- 
ported as "v-ets erythroblastosis virus E26 oncogene homolog 
(avian) (ERG), transcript variant 2, mRNA" (information up- 
dated on 28 June 2009). In Tab. [8]we show the performances 
obtained by a SRDA and a DLDA model with the sole feature 
DAP2_5229 on all combinations of US and Sweden cohort 
as training and test set. If we consider as the global optimal 
list the list of all 30 distinct features given as the union of 
the Borda Ust in Tab. |6] we get for SRDA and DLDA models 
the performances listed in Tab. [8] To check the consistency 
of the retrieved global list, we run a permutation test: we 
randomly extract 30 features out of the original 6144 features 
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Table 5: AUC values for the two Setlur datasets for SRDA 


classifiers. 
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Table 6: Borda optimal lists for SRDA models on the two 
Setlur datasets. In boldface, probes common to the two op- 
timal lists. In italic, probes included in the 87-gene signature 
of the original paper l36l . 17 probes out of 30 are common to 
the 87-gene signature in | 



DAP2_5229 




-1.5 -1.0 -0.5 0.0 0.5 1.0 



Figure 4: Boxplot of the DAP2_5229 expression value sepa- 
rately for the two Setlur datasets and the two class labels. 
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Table 8: MCC values for SRDA and DLDA models with the 
only feature DAP2_5229 and with the global optimal list. 



and we use as the p-value the number of times the obtained 
performances (DLDA models) are better than those obtained 
with the global optimal list, divided by the total number 10^ 
of experiments. The resulting p-values are less than 10^3 for 
all four combinations of using the two cohorts as training and 
test set, thus obtaining a reasonable significance of the global 
optimal list. Nevertheless if the same permutation test is run 
with the feature DAP2_5229 always occurring in the chosen 
random feature sets, the results are very different: namely, the 
p-value results about 0.1, thus indicating a small statistical 
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Table 7: Setlur dataset. MCC values for SRDA and DLDA 
optimal models. 



significance of the obtained global list. These tests seems to 
indicate that the occurrence of DAP2_5229 plays a key role in 
finding a correct predictive signature. We then performed a 
further experiment to detect the predictive power of the global 
optimal list as a function of its length. We order the global list 
keeping DAP2^229, DAP4^051, DAP0857, DAP3_0905, 
and DAP1_5091 as the first five probes and compute the 
performances of a DLDA model by increasing the number 
of features extracted from the global list from 1 to 30. The 
result is shown in Fig.|5] for many of the displayed models a 
reduced optimal list of about 10-12 features is sufficient to get 
almost optimal predictive performances. A permutation test 
on 12 features (with DAP2_5229 kept as the top probe) gives 
a p-value of 10~2. A final note: our results show a slightly 
better AUC (in training) than the one found by the authors 
of the original paper Il36l . both in the Sweden and in the US 
cohort. Moreover, 17 out of 30 genes included in the global 
optimal list are member of the 87-gene signature shown in the 
original paper. 
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Figure 6: (a): Canberra core evaluated on the Setlur dataset on B=100 repeated filtering experiments on 90% of the data, (b) 
Zoom on the 80%-100% threshold zone. K = 10^ 
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Figure 5: MCC for SRDA and DLDA models on increasing 
number of features extracted from the global list from 1 to 30 
on the Setlur data. 

3.3 Comparing 

The seven filtering algorithms of the previous subsection are 
also applied to the Setlur dataset, by using 100 resampling on 
90% of the data on both the US and Sweden cohort separately. 
The Canberra Core values of the lists at different values of the 



filtering thresholds are shown in Fig. |6] together with a zoom 
on the stricter constraints area: the plots confirm the different 
behaviour of the groups {t, mod-t, SAM} and {F, mod-_F, B} 
and of the singleton FC in both cases. By considering a cut- 
ting threshold of the 75% of the maximal value, we retrieve 14 
sets of ranked partial lists, from which 14 Borda optimal lists 
are computed. In Tab. |9]we list the lengths of the Borda lists 
for each filtering method and cohort. 

As a first rough set-theoretical comparison, we list in Tab. 
I3.3l the probes common to more than three filtering methods. 
We note that only three probes are also appearing in the corre- 
sponding SRDA Borda list. In order to get a more refined in- 
dicator of similarity, we also compute the Core Canberra Dis- 
tances between all Borda optimal lists and between all 75%- 
threshold partial lists for filtering methods, together with the 
corresponding partial and Borda lists for the SRDA models; 
all results are reported in Tab. [TO]By using the Core distances, 
we draw two levelplots (for both distances on Borda lists and 
on the whole partial lists sets, computing also a hierarchical 



F FC modp modi I B SAM 

Sweden i 17 25 759 326 28 366 
US 1 3 6 208 367 7 149 

Table 9: Length of the Borda Usts for different filtering meth- 
ods at 75% threshold on the Setlur dataset. 
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Table 10: Distances between Borda optimal lists (upper triangular matrix) and between all partial lists (lower triangular matrix, 
xlO^) for filtering methods (75% threshold) and SRDA models. Rows and columns 1-8 (Italic): Sweden cohort; rows and 
columns 9-16: US cohort. 
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Figure 7: Levelplot of the distances computed on the lists produced by filtering methods (75% threshold) and SRDA models, 
where the Canberra Distance is computed on (a) their Borda lists; (b) their whole list sets. 



cluster with average linkage and representing also the corre- 
sponding dendrogram in Fig. |7] A further graphical repre- 
sentation of the computed distances has been obtained by us- 
ing a Multidimensional Scaling (MDS) on two components, 
as shown in Fig. 13.31 On both cases, a few facts emerges: on 
both cohorts, the results on the Borda lists and on the whole 
sets of lists are similar, indicating that the Borda method is a 
good way to incorporate information into a single list; the be- 
haviour grouping detected in the previous subsection is essen- 
tially confirmed here. Moreover, the two cohort are quite dif- 
ferent, while the lists coming from the profiling experiments 
are not deeply different from those emerging by the filtering 



methods. 



4 DISCUSSION 

In HI, a correlation between signature congruency and model 
performance in MAQC-11 ifTTl has been detected both in train- 
ing and validation sets: the more similar the signatures, the 
better the average predictions. 

The range of possible applications is clearly not limited to 
the example shown in the present work: in IH |2l El IH the im- 
portance of defining indicators for assessing ranked lists vari- 
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Figure 8: Multidimensional Scaling (MDS) on two components computed on the lists produced by filtering methods (75% 
threshold) and SRDA models, where the Canberra distance is computed on (a) their Borda lists; (b) their whole list sets. 
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Table 1 1 : List of probes common to more than three filtering 
methods 



Il68l . 1691 . but so far no general framework has been structured 
yet. 

5 CONCLUSIONS 

In the present work, an effective method for comparing het- 
erogeneous ranked lists coming from different experiments is 
shown. The algorithm is designed within the framework of the 
theory of metric methods for permutation groups. The intro- 
duced metric can be used in different contexts and for different 
purposes in many aspects of computational biology. A few ex- 
amples of use are shown in the final section of the paper. 



ability is discussed. At least two more applications are worth 
mentioning, metanalysis studies to investigate relationship be- 
tween stability and classification accuracy (see for instance 
ifTTl ) and analysis of lists produced by methods of gene lists 
enrichment. 

As a final consideration, the described method may be- 
come an essential tool towards a theoretical improvement in 
the workfield, that is, the construction of a stability theory 
for feature selection, for instance in the leave-one-out stability 
theory already developed for classifiers ll63l . 1641 . Attempts 
in this direction has been recently carried out ll65l . l66l . l67l . 
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